Clipping Outliers: NYC hotel pricing dataset analysis¶
Sometimes, while analyzing a dataset, there can be some data present which might exert undue influence while building models, like linear regression. These data are called outliers. Outliers can sometimes mislead the set of data and influence model performance as well.
What are outliers?
In data science, outliers are values within a dataset that vary greatly from the others, they are either much larger, or significantly smaller. Outliers can appear in a dataset due to variability of measurement, error in data, experimental error etc. Outliers can cause machine learning models to make inaccurate predictions when they are included in the training data, so they need to be handled before training a model.
One of the best ways to understand outliers is box plots.
Boxplots are very useful to see the distribution of a variable/feature and detect outliers in them. It is a useful graphical representation for describing the behavior of the data in the middle as well as both ends of the distribution. A box plot shows the data based on the five-number summary:
- Minimum: the lowest data point in a variable excluding any outliers
- Median (Q2 or 50th percentile): the middle value in the variable
- First quartile (Q1 or 25th percentile): also known as the lower quartile (0.25)
- Third quartile (Q3 or 75th percentile): also known as the upper quartile (0.75)
- Maximum: the highest data point in the variable excluding any outliers
Interquartile Range:
The difference between the lower quartile and the upper quartile(Q3 - Q1) is called the interquartile range or IQR.
Boxplots help us find the outliers in the data by using the IQR. As a rule, values that are outside the range of 1.5*IQR from Q1 and Q3 are regarded as outliers. The below image will help us better understand the outliers in our data.
In the image above, the points that are outside the whisker lines are the outliers.
There are different techniques to handle outliers in a dataset. In our example, we will use the concept of clipping (winsorizing).
What is winsorizing/clipping?
Clipping data from a dataset means to clip the data at the last permitted extreme value, e.g. the 5th or 95th percentile value. For example, when we clip the data to 95th percentile, values over the 95th percentile will be set to the 95th percentile value meaning all the values greater than 95% percent will equal to the 95th percentile value.
The following data set has several (bolded) extremes:
- {0.1, 1, 12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 99, 125}
After clipping/winsorizing the top and bottom 10% of the data(matching those values to the nearest extreme), we get:
- {12, 12,12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 44, 44}
Let us solve a problem that removes outliers from data using clipping.
Problem Description¶
For illustration of the clipping method, lets look at an example.
We have a dataset named nyc_airbnb.csv , which
contains data about price of AirBnb per-night rental houses. In
the dataset, there exists some outliers in the
price variable. Our task is to find out the
outliers and remove them by winsorizing/clipping.
First , we load our dataset into a dataframe and view it.
Load the Dataset and View data¶
Step 1: import the pandas library
as pd
import pandas as pd
Step 2: Load the data into a variable
nyc using read_csv method in pandas
nyc= pd.read_csv("../datasets/nyc_airbnb.csv")
Step 3: View the variable nyc.
nyc
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48890 | 36484665 | Charming one bedroom - newly renovated rowhouse | 8232441 | Sabrina | Brooklyn | Bedford-Stuyvesant | 40.67853 | -73.94995 | Private room | 70 | 2 | 0 | NaN | NaN | 2 | 9 |
| 48891 | 36485057 | Affordable room in Bushwick/East Williamsburg | 6570630 | Marisol | Brooklyn | Bushwick | 40.70184 | -73.93317 | Private room | 40 | 4 | 0 | NaN | NaN | 2 | 36 |
| 48892 | 36485431 | Sunny Studio at Historical Neighborhood | 23492952 | Ilgar & Aysel | Manhattan | Harlem | 40.81475 | -73.94867 | Entire home/apt | 115 | 10 | 0 | NaN | NaN | 1 | 27 |
| 48893 | 36485609 | 43rd St. Time Square-cozy single bed | 30985759 | Taz | Manhattan | Hell's Kitchen | 40.75751 | -73.99112 | Shared room | 55 | 1 | 0 | NaN | NaN | 6 | 2 |
| 48894 | 36487245 | Trendy duplex in the very heart of Hell's Kitchen | 68119814 | Christophe | Manhattan | Hell's Kitchen | 40.76404 | -73.98933 | Private room | 90 | 7 | 0 | NaN | NaN | 1 | 23 |
48895 rows × 16 columns
Check for outliers in price data¶
Since we are looking to find out the outliers in the hotel price
, an effective way of finding outliers is using visualizations.
To see where the outliers lie in price data, strip
plot is a very useful graph to see how the datapoints are
spread.
Plot a strip plot for outlier estimation:¶
For our strip plot, we visualize every datapoint of the
price data. We look at the spread of
price data in the y axis. For this plot, we import
the plotly express library.
Step 1: Import the
plotly.express library as px
import plotly.express as px
Step 2: Using px, call the
strip() method to generate the strip plot
-
Inside the method, the parameters will be,
nyc: variable where the data is stored-
price: column data to plot in the y axis
-
Store the result into a variable
price_stripthat will save the plot in this variable
price_strip = px.strip(nyc, y='price')
Step 3: Display the variable
price_strip using the show() method
price_strip.show()